AITopics | multimodal knowledge

Collaborating Authors

multimodal knowledge

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

GIIFT: Graph-guided Inductive Image-free Multimodal Machine Translation

Xiong, Jiafeng, Zhao, Yuting

arXiv.org Artificial IntelligenceOct-9-2025

Multimodal Machine Translation (MMT) has demonstrated the significant help of visual information in machine translation. However, existing MMT methods face challenges in leveraging the modality gap by enforcing rigid visual-linguistic alignment whilst being confined to inference within their trained multimodal domains. In this work, we construct novel multimodal scene graphs to preserve and integrate modality-specific information and introduce GIIFT, a two-stage Graph-guided Inductive Image-Free MMT framework that uses a cross-modal Graph Attention Network adapter to learn multimodal knowledge in a unified fused space and inductively generalize it to broader image-free translation domains. Experimental results on the Multi30K dataset of English-to-French and English-to-German tasks demonstrate that our GIIFT surpasses existing approaches and achieves the state-of-the-art, even without images during inference. Results on the WMT benchmark show significant improvements over the image-free translation baselines, demonstrating the strength of GIIFT towards inductive image-free inference.

artificial intelligence, machine translation, natural language, (12 more...)

arXiv.org Artificial Intelligence

2507.18562

Country:

North America > United States (1.00)
Europe (1.00)

Genre:

Research Report (0.64)
Overview (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

LeMoLE: LLM-Enhanced Mixture of Linear Experts for Time Series Forecasting

Zhang, Lingzheng, Shen, Lifeng, Zheng, Yimin, Piao, Shiyuan, Li, Ziyue, Tsung, Fugee

arXiv.org Artificial IntelligenceNov-24-2024

Recent research has shown that large language models (LLMs) can be effectively used for real-world time series forecasting due to their strong natural language understanding capabilities. However, aligning time series into semantic spaces of LLMs comes with high computational costs and inference complexity, particularly for long-range time series generation. Building on recent advancements in using linear models for time series, this paper introduces an LLM-enhanced mixture of linear experts for precise and efficient time series forecasting. This approach involves developing a mixture of linear experts with multiple lookback lengths and a new multimodal fusion mechanism. The use of a mixture of linear experts is efficient due to its simplicity, while the multimodal fusion mechanism adaptively combines multiple linear experts based on the learned features of the text modality from pre-trained large language models. In experiments, we rethink the need to align time series to LLMs by existing time-series large language models and further discuss their efficiency and effectiveness in time series forecasting. Our experimental results show that the proposed LeMoLE model presents lower prediction errors and higher computational efficiency than existing LLM models.

forecasting, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2412.00053

Country:

Asia > China > Hong Kong (0.04)
North America > United States > California > San Francisco County > San Francisco (0.04)
North America > Trinidad and Tobago > Trinidad > Arima > Arima (0.04)
Asia > China > Guangdong Province > Guangzhou (0.04)

Genre: Research Report > New Finding (0.87)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

A Multimodal Knowledge-enhanced Whole-slide Pathology Foundation Model

Xu, Yingxue, Wang, Yihui, Zhou, Fengtao, Ma, Jiabo, Yang, Shu, Lin, Huangjing, Wang, Xin, Wang, Jiguang, Liang, Li, Han, Anjia, Chan, Ronald Cheong Kin, Chen, Hao

arXiv.org Artificial IntelligenceAug-5-2024

Remarkable strides in computational pathology have been made in the task-agnostic foundation model that advances the performance of a wide array of downstream clinical tasks. Despite the promising performance, there are still several challenges. First, prior works have resorted to either vision-only or vision-captions data, disregarding invaluable pathology reports and gene expression profiles which respectively offer distinct knowledge for versatile clinical applications. Second, the current progress in pathology FMs predominantly concentrates on the patch level, where the restricted context of patch-level pretraining fails to capture whole-slide patterns. Here we curated the largest multimodal dataset consisting of H\&E diagnostic whole slide images and their associated pathology reports and RNA-Seq data, resulting in 26,169 slide-level modality pairs from 10,275 patients across 32 cancer types. To leverage these data for CPath, we propose a novel whole-slide pretraining paradigm which injects multimodal knowledge at the whole-slide context into the pathology FM, called Multimodal Self-TAught PRetraining (mSTAR). The proposed paradigm revolutionizes the workflow of pretraining for CPath, which enables the pathology FM to acquire the whole-slide context. To our knowledge, this is the first attempt to incorporate multimodal knowledge at the slide level for enhancing pathology FMs, expanding the modelling context from unimodal to multimodal knowledge and from patch-level to slide-level. To systematically evaluate the capabilities of mSTAR, extensive experiments including slide-level unimodal and multimodal applications, are conducted across 7 diverse types of tasks on 43 subtasks, resulting in the largest spectrum of downstream tasks. The average performance in various slide-level applications consistently demonstrates significant performance enhancements for mSTAR compared to SOTA FMs.

dataset, knowledge, mstar, (14 more...)

arXiv.org Artificial Intelligence

2407.15362

Country:

Asia > China > Hong Kong (0.05)
Asia > China > Guangdong Province > Guangzhou (0.04)
North America > Canada > British Columbia (0.04)
(3 more...)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.71)

Industry:

Health & Medicine > Therapeutic Area > Obstetrics/Gynecology (1.00)
Health & Medicine > Diagnostic Medicine (1.00)
Health & Medicine > Therapeutic Area > Dermatology (0.93)
Health & Medicine > Therapeutic Area > Oncology > Carcinoma (0.46)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(2 more...)

Add feedback

MC-MKE: A Fine-Grained Multimodal Knowledge Editing Benchmark Emphasizing Modality Consistency

Zhang, Junzhe, Zhang, Huixuan, Yin, Xunjian, Huang, Baizhou, Zhang, Xu, Hu, Xinyu, Wan, Xiaojun

arXiv.org Artificial IntelligenceJun-19-2024

Multimodal large language models (MLLMs) are prone to non-factual or outdated knowledge issues, which can manifest as misreading and misrecognition errors due to the complexity of multimodal knowledge. Previous benchmarks have not systematically analyzed the performance of editing methods in correcting these two error types. To better represent and correct these errors, we decompose multimodal knowledge into its visual and textual components. Different error types correspond to different editing formats, which edits distinct part of the multimodal knowledge. We present MC-MKE, a fine-grained Multimodal Knowledge Editing benchmark emphasizing Modality Consistency. Our benchmark facilitates independent correction of misreading and misrecognition errors by editing the corresponding knowledge component. We evaluate three multimodal knowledge editing methods on MC-MKE, revealing their limitations, particularly in terms of modality consistency. Our work highlights the challenges posed by multimodal knowledge editing and motivates further research in developing effective techniques for this task.

editing, knowledge, multimodal knowledge, (16 more...)

arXiv.org Artificial Intelligence

2406.13219

Country:

Europe > United Kingdom (0.05)
South America > Argentina (0.04)
North America > United States (0.04)
(2 more...)

Genre: Research Report (0.50)

Industry: Leisure & Entertainment (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Distilling Implicit Multimodal Knowledge into LLMs for Zero-Resource Dialogue Generation

Zhang, Bo, Ma, Hui, Ding, Jian, Wang, Jian, Xu, Bo, Lin, Hongfei

arXiv.org Artificial IntelligenceMay-16-2024

Integrating multimodal knowledge into large language models (LLMs) represents a significant advancement in dialogue generation capabilities. However, the effective incorporation of such knowledge in zero-resource scenarios remains a substantial challenge due to the scarcity of diverse, high-quality dialogue datasets. To address this, we propose the Visual Implicit Knowledge Distillation Framework (VIKDF), an innovative approach aimed at enhancing LLMs for enriched dialogue generation in zero-resource contexts by leveraging implicit multimodal knowledge. VIKDF comprises two main stages: knowledge distillation, using an Implicit Query Transformer to extract and encode visual implicit knowledge from image-text pairs into knowledge vectors; and knowledge integration, employing a novel Bidirectional Variational Information Fusion technique to seamlessly integrate these distilled vectors into LLMs. This enables the LLMs to generate dialogues that are not only coherent and engaging but also exhibit a deep understanding of the context through implicit multimodal cues, effectively overcoming the limitations of zero-resource scenarios. Our extensive experimentation across two dialogue datasets shows that VIKDF outperforms existing state-of-the-art models in generating high-quality dialogues. The code will be publicly available following acceptance.

dialogue, knowledge, visual implicit knowledge, (16 more...)

arXiv.org Artificial Intelligence

2405.10121

Country:

Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Asia > China > Liaoning Province > Dalian (0.04)
Oceania > Australia > Victoria > Melbourne (0.04)

Genre: Research Report > Promising Solution (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Multimodal Machine Unlearning

Cheng, Jiali, Amiri, Hadi

arXiv.org Artificial IntelligenceNov-18-2023

Machine Unlearning is the process of removing specific training data samples and their corresponding effects from an already trained model. It has significant practical benefits, such as purging private, inaccurate, or outdated information from trained models without the need for complete re-training. Unlearning within a multimodal setting presents unique challenges due to the intrinsic dependencies between different data modalities and the expensive cost of training on large multimodal datasets and architectures. Current approaches to machine unlearning have not fully addressed these challenges. To bridge this gap, we introduce MMUL, a machine unlearning approach specifically designed for multimodal data and models. MMUL formulates the multimodal unlearning task by focusing on three key properties: (a): modality decoupling, which effectively decouples the association between individual unimodal data points within multimodal inputs marked for deletion, rendering them as unrelated data points within the model's context, (b): unimodal knowledge retention, which retains the unimodal representation capability of the model post-unlearning, and (c): multimodal knowledge retention, which retains the multimodal representation capability of the model post-unlearning. MMUL is efficient to train and is not constrained by the requirement of using a strongly convex loss. Experiments on two multimodal models and four multimodal benchmark datasets, including vision-language and graph-language datasets, show that MMUL outperforms existing baselines, gaining an average improvement of +17.6 points against the best-performing unimodal baseline in distinguishing between deleted and remaining data. In addition, MMUL can largely maintain pre-existing knowledge of the original model post unlearning, with a performance gap of only 0.3 points compared to retraining a new model from scratch.

knowledge, knowledge retention, retention, (16 more...)

arXiv.org Artificial Intelligence

2311.12047

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > Canada > Ontario > Toronto (0.04)
North America > United States > Washington > King County > Seattle (0.04)
(3 more...)

Genre: Research Report (1.00)

Industry: Information Technology > Security & Privacy (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Vision (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)

Add feedback

VQA-GNN: Reasoning with Multimodal Knowledge via Graph Neural Networks for Visual Question Answering

Wang, Yanan, Yasunaga, Michihiro, Ren, Hongyu, Wada, Shinya, Leskovec, Jure

arXiv.org Artificial IntelligenceSep-15-2023

Visual question answering (VQA) requires systems to perform concept-level reasoning by unifying unstructured (e.g., the context in question and answer; "QA context") and structured (e.g., knowledge graph for the QA context and scene; "concept graph") multimodal knowledge. Existing works typically combine a scene graph and a concept graph of the scene by connecting corresponding visual nodes and concept nodes, then incorporate the QA context representation to perform question answering. However, these methods only perform a unidirectional fusion from unstructured knowledge to structured knowledge, limiting their potential to capture joint reasoning over the heterogeneous modalities of knowledge. To perform more expressive reasoning, we propose VQA-GNN, a new VQA model that performs bidirectional fusion between unstructured and structured multimodal knowledge to obtain unified knowledge representations. Specifically, we inter-connect the scene graph and the concept graph through a super node that represents the QA context, and introduce a new multimodal GNN technique to perform inter-modal message passing for reasoning that mitigates representational gaps between modalities. On two challenging VQA tasks (VCR and GQA), our method outperforms strong baseline VQA methods by 3.2% on VCR (Q-AR) and 4.6% on GQA, suggesting its strength in performing concept-level reasoning. Ablation studies further demonstrate the efficacy of the bidirectional fusion and multimodal GNN method in unifying unstructured and structured multimodal knowledge.

graph, knowledge, representation, (15 more...)

arXiv.org Artificial Intelligence

2205.11501

Country: North America > United States > California > Santa Clara County > Palo Alto (0.04)

Genre: Research Report > New Finding (0.68)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.83)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.66)

Add feedback

Recognizing Unseen Objects via Multimodal Intensive Knowledge Graph Propagation

Wu, Likang, Li, Zhi, Zhao, Hongke, Wang, Zhefeng, Liu, Qi, Huai, Baoxing, Yuan, Nicholas Jing, Chen, Enhong

arXiv.org Artificial IntelligenceJun-20-2023

Zero-Shot Learning (ZSL), which aims at automatically recognizing unseen objects, is a promising learning paradigm to understand new real-world knowledge for machines continuously. Recently, the Knowledge Graph (KG) has been proven as an effective scheme for handling the zero-shot task with large-scale and non-attribute data. Prior studies always embed relationships of seen and unseen objects into visual information from existing knowledge graphs to promote the cognitive ability of the unseen data. Actually, real-world knowledge is naturally formed by multimodal facts. Compared with ordinary structural knowledge from a graph perspective, multimodal KG can provide cognitive systems with fine-grained knowledge. For example, the text description and visual content can depict more critical details of a fact than only depending on knowledge triplets. Unfortunately, this multimodal fine-grained knowledge is largely unexploited due to the bottleneck of feature alignment between different modalities. To that end, we propose a multimodal intensive ZSL framework that matches regions of images with corresponding semantic embeddings via a designed dense attention module and self-calibration loss. It makes the semantic transfer process of our ZSL framework learns more differentiated knowledge between entities. Our model also gets rid of the performance limitation of only using rough global features. We conduct extensive experiments and evaluate our model on large-scale real-world data. The experimental results clearly demonstrate the effectiveness of the proposed model in standard zero-shot classification tasks.

dataset, knowledge, proceedings, (14 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3580305.3599486

2306.08487

Country:

North America > United States > California > Los Angeles County > Long Beach (0.05)
Asia > China > Zhejiang Province > Hangzhou (0.04)
Asia > China > Anhui Province > Hefei (0.04)
(4 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Semantic Networks (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
(3 more...)

Add feedback

Combo of Thinking and Observing for Outside-Knowledge VQA

Si, Qingyi, Mo, Yuchen, Lin, Zheng, Ji, Huishan, Wang, Weiping

arXiv.org Artificial IntelligenceMay-10-2023

Outside-knowledge visual question answering is a challenging task that requires both the acquisition and the use of open-ended real-world knowledge. Some existing solutions draw external knowledge into the cross-modality space which overlooks the much vaster textual knowledge in natural-language space, while others transform the image into a text that further fuses with the textual knowledge into the natural-language space and completely abandons the use of visual features. In this paper, we are inspired to constrain the cross-modality space into the same space of natural-language space which makes the visual features preserved directly, and the model still benefits from the vast knowledge in natural-language space. To this end, we propose a novel framework consisting of a multimodal encoder, a textual encoder and an answer decoder. Such structure allows us to introduce more types of knowledge including explicit and implicit multimodal and textual knowledge. Extensive experiments validate the superiority of the proposed method which outperforms the state-of-the-art by 6.17% accuracy. We also conduct comprehensive ablations of each component, and systematically study the roles of varying types of knowledge. Codes and knowledge data can be found at https://github.com/PhoebusSi/Thinking-while-Observing.

large language model, natural language, question answering, (19 more...)

arXiv.org Artificial Intelligence

2305.06407

Country:

North America (0.14)
Asia > China > Beijing > Beijing (0.04)
South America (0.04)
(2 more...)

Genre: Research Report (0.64)

Industry:

Transportation > Air (1.00)
Media (0.93)
Leisure & Entertainment (0.93)
Aerospace & Defense (0.93)

Technology:

Information Technology > Artificial Intelligence > Vision > Image Understanding (0.55)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.50)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.50)

Add feedback